Predicting Belgian Real Estate Prices: Part 3: Characterizing Belgian Real Estate Data Post-Scraping

Predicting Belgian Real Estate Prices
Author

Adam Cseresznye

Published

November 5, 2023

Photo by Stephen Phillips - Hostreviews.co.uk on UnSplash

In the second part, we delved into the intricacies of data processing necessary after scraping. Our discussion covered the treatment of numerical data, handling of categorical variables, and management of boolean values. Furthermore, we evaluated the data quality by scrutinizing the error log produced by the Immowebscraper class. In the upcoming Part 3, our focus will shift to getting a fundamental overview of our data by characterizing the cleaned scraped data. Additionally, we aim to assess feature cardinality, scrutinize distributions, and explore potential correlations between features and our target variable—property price.

Note

You can access the project’s app through its Streamlit website.

Import data

Code
import time
from pathlib import Path

import creds
import numpy as np
import pandas as pd
from data import pre_process, utils
from lets_plot import *
from lets_plot.bistro.corr import *
from lets_plot.geo_data import *
from lets_plot.mapping import as_discrete
from tqdm import tqdm

LetsPlot.setup_html(offline=True)
The geodata is provided by © OpenStreetMap contributors and is made available here under the Open Database License (ODbL).

Data Inspection

Our initial step involves reading the cleaned-up dataset that was initially scraped. This dataset is stored in the interim folder following the cleaning process, and we will access it from there.

Code
df = pd.read_parquet(
    utils.Configuration.INTERIM_DATA_PATH.joinpath(
        "2023-10-01_Processed_dataset_for_NB_use.parquet.gzip"
    )
)
df.head().style.set_sticky(axis=0)
  as_built_plan available_as_of basement bathrooms bedroom_1_surface bedroom_2_surface bedroom_3_surface bedrooms building_condition co2_emission cadastral_income connection_to_sewer_network construction_year covered_parking_spaces dining_room double_glazing energy_class external_reference flood_zone_type furnished garden_surface gas_water__electricity heating_type kitchen_surface kitchen_type latest_land_use_designation living_area living_room_surface number_of_frontages office outdoor_parking_spaces planning_permission_obtained possible_priority_purchase_right price primary_energy_consumption proceedings_for_breach_of_planning_regulations reference_number_of_the_epc_report street_frontage_width subdivision_permit surface_of_the_plot surroundings_type tv_cable tenement_building toilets website width_of_the_lot_on_the_street yearly_theoretical_total_energy_consumption ad_url day_of_retrieval housenumber street city postal state lat lng
0 nan After signing the deed 1.000000 4.000000 28.000000 24.000000 21.000000 5.000000 Good 74.000000 7615.000000 0.000000 1975.000000 2.000000 1.000000 1.000000 D 5530019 nan 0.000000 4300.000000 1.000000 Fuel oil 21.000000 USA hyper equipped None 420.000000 63.000000 4.000000 1.000000 2.000000 nan nan 1225000.000000 296.000000 nan 20210910012683 23.000000 nan 4677.000000 Living area (residential, urban or rural) 1.000000 0.000000 5.000000 http://www.bytheway.be nan 133062.000000 https://www.immoweb.be/en/classified/villa/for-sale/braine-l%27alleud/1420/10844811 2023-09-27 14:02:10.370650 96 Drève Richelle Waterloo 1410 Région Wallonne 50.708475 4.407577
1 0.000000 After signing the deed nan 1.000000 14.000000 10.000000 nan 2.000000 To renovate 76.000000 541.000000 1.000000 1850.000000 1.000000 1.000000 1.000000 D 5246367 nan 0.000000 nan nan Fuel oil nan Semi equipped Living area (residential, urban or rural) 113.000000 nan 2.000000 nan nan nan 0.000000 89000.000000 307.000000 0.000000 20230306014621 9.000000 0.000000 73.000000 Urban 1.000000 0.000000 1.000000 http://www.weinvest.be 10.000000 43587.000000 https://www.immoweb.be/en/classified/town-house/for-sale/jemelle/5580/10843929 2023-09-27 14:02:10.641758 65 Chau. de l'Ourthe Marche-en-Famenne 6900 Région Wallonne 50.231363 5.350163
2 nan After signing the deed 1.000000 1.000000 17.000000 11.000000 nan 2.000000 To renovate 173.000000 689.000000 1.000000 1949.000000 1.000000 1.000000 1.000000 G 5534704 0.000000 0.000000 315.000000 nan Fuel oil 13.000000 Semi equipped Living area (residential, urban or rural) 139.000000 10.000000 3.000000 1.000000 nan nan 0.000000 150000.000000 699.000000 0.000000 20230303013078 18.000000 0.000000 413.000000 Isolated 1.000000 0.000000 1.000000 http://www.nigel-immo.be 18.000000 96913.000000 https://www.immoweb.be/en/classified/house/for-sale/esneux%20tilff/4130/10838582 2023-09-27 14:02:10.905808 2a Rue de la Wallonie Oupeye 4680 Région Wallonne 50.708887 5.622615
3 nan After signing the deed nan 2.000000 14.000000 14.000000 10.000000 3.000000 Good 4167.000000 898.000000 1.000000 1899.000000 nan nan 1.000000 C 5535455 0.000000 0.000000 nan 1.000000 Gas 17.000000 USA installed Living area (residential, urban or rural) 119.000000 14.000000 2.000000 nan nan 1.000000 1.000000 272000.000000 246.000000 0.000000 20230622-0002923649-RES-1 4.000000 0.000000 55.000000 Urban 1.000000 0.000000 2.000000 http://www.immodavinci.be nan nan https://www.immoweb.be/en/classified/house/for-sale/gent/9000/10838392 2023-09-27 14:02:11.811691 1 Sint-Denijslaan Gent 9000 Vlaams Gewest 51.033681 3.715034
4 nan At delivery nan 1.000000 16.000000 12.000000 12.000000 3.000000 As new nan nan 1.000000 2021.000000 nan nan 1.000000 A 5527171 0.000000 0.000000 250.000000 1.000000 Gas nan USA hyper equipped Living area (residential, urban or rural) 215.000000 46.000000 3.000000 nan 1.000000 1.000000 0.000000 413150.000000 27.000000 nan 12345 7.000000 1.000000 330.000000 Living area (residential, urban or rural) 1.000000 0.000000 2.000000 http://www.living-stone.be nan nan https://www.immoweb.be/en/classified/house/for-sale/affligem/1790/10842333 2023-09-27 14:02:11.968969 1 Astridlaan Dilbeek 1700 Vlaams Gewest 50.850220 4.265908

To examine the data types of the columns, we will utilize the info() method.

Code
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3660 entries, 0 to 3659
Data columns (total 56 columns):
 #   Column                                          Non-Null Count  Dtype         
---  ------                                          --------------  -----         
 0   as_built_plan                                   1722 non-null   float64       
 1   available_as_of                                 2134 non-null   object        
 2   basement                                        1707 non-null   float64       
 3   bathrooms                                       3472 non-null   float32       
 4   bedroom_1_surface                               2495 non-null   float32       
 5   bedroom_2_surface                               2439 non-null   float32       
 6   bedroom_3_surface                               2005 non-null   float32       
 7   bedrooms                                        3641 non-null   float32       
 8   building_condition                              3515 non-null   object        
 9   co2_emission                                    1575 non-null   float32       
 10  cadastral_income                                2865 non-null   float32       
 11  connection_to_sewer_network                     1635 non-null   float64       
 12  construction_year                               2598 non-null   float32       
 13  covered_parking_spaces                          2009 non-null   float32       
 14  dining_room                                     1073 non-null   float64       
 15  double_glazing                                  3096 non-null   float64       
 16  energy_class                                    3658 non-null   object        
 17  external_reference                              3305 non-null   object        
 18  flood_zone_type                                 1991 non-null   float64       
 19  furnished                                       1773 non-null   float64       
 20  garden_surface                                  1736 non-null   float32       
 21  gas_water__electricity                          2454 non-null   float64       
 22  heating_type                                    3191 non-null   object        
 23  kitchen_surface                                 2129 non-null   float32       
 24  kitchen_type                                    3054 non-null   object        
 25  latest_land_use_designation                     1725 non-null   object        
 26  living_area                                     3595 non-null   float32       
 27  living_room_surface                             2367 non-null   float32       
 28  number_of_frontages                             3463 non-null   float32       
 29  office                                          1080 non-null   float64       
 30  outdoor_parking_spaces                          1757 non-null   float32       
 31  planning_permission_obtained                    1176 non-null   float64       
 32  possible_priority_purchase_right                1667 non-null   float64       
 33  price                                           3660 non-null   float64       
 34  primary_energy_consumption                      3512 non-null   float32       
 35  proceedings_for_breach_of_planning_regulations  1328 non-null   float64       
 36  reference_number_of_the_epc_report              3658 non-null   object        
 37  street_frontage_width                           1701 non-null   float32       
 38  subdivision_permit                              1398 non-null   float64       
 39  surface_of_the_plot                             3536 non-null   float32       
 40  surroundings_type                               1861 non-null   object        
 41  tv_cable                                        1244 non-null   float64       
 42  tenement_building                               3628 non-null   float64       
 43  toilets                                         3325 non-null   float32       
 44  website                                         3096 non-null   object        
 45  width_of_the_lot_on_the_street                  1553 non-null   float32       
 46  yearly_theoretical_total_energy_consumption     1447 non-null   float32       
 47  ad_url                                          3660 non-null   object        
 48  day_of_retrieval                                3660 non-null   datetime64[us]
 49  housenumber                                     3577 non-null   object        
 50  street                                          3612 non-null   object        
 51  city                                            3634 non-null   object        
 52  postal                                          3641 non-null   object        
 53  state                                           3641 non-null   object        
 54  lat                                             3641 non-null   float64       
 55  lng                                             3641 non-null   float64       
dtypes: datetime64[us](1), float32(21), float64(18), object(16)
memory usage: 1.3+ MB

Assessing Feature Cardinality

Now, let’s assess the feature cardinality of our dataset to differentiate between categorical and numerical data. Initially, we will analyze the percentage of unique values per feature, followed by displaying the absolute number of unique values per feature.

Code
# Assuming df is your DataFrame
number_unique_entries = {
    "column_name": df.columns.tolist(),
    "column_dtype": [df[col].dtype for col in df.columns],
    "unique_values_pct": [df[col].nunique() for col in df.columns],
}

(
    pd.DataFrame(number_unique_entries)
    .sort_values("unique_values_pct")
    .assign(
        unique_values_pct=lambda x: x.unique_values_pct.div(df.shape[0])
        .mul(100)
        .round(1)
    )
    .pipe(
        lambda df: ggplot(df, aes("unique_values_pct", "column_name"))
        + geom_bar(stat="identity", orientation="y")
        + labs(
            title="Assessing Feature Cardinality",
            subtitle=""" Features with a Low Cardinality (Less than 10 Distinct Values) Can Be  Utilized as Categorical Variables, 
            while Those with Higher Cardinality, typically represented as floats or integers, May Be Used as They Are
            """,
            x="Percentage of Unique Values per Feature",
            y="",
            caption="https://www.immoweb.be/",
        )
        + theme(
            plot_subtitle=element_text(
                size=12, face="italic"
            ),  # Customize subtitle appearance
            plot_title=element_text(size=15, face="bold"),  # Customize title appearance
        )
        + ggsize(800, 1000)
    )
)
Figure 1: Assessing Feature Cardinality: Percentage of Unique Values per Feature
Code
(
    pd.DataFrame(number_unique_entries)
    .sort_values("unique_values_pct")
    .pipe(
        lambda df: ggplot(df, aes("unique_values_pct", "column_name"))
        + geom_bar(stat="identity", orientation="y")
        + labs(
            title="Assessing Feature Cardinality",
            subtitle=""" Features with a Low Cardinality (Less than 10 Distinct Values) Can Be  Utilized as Categorical Variables, 
            while Those with Higher Cardinality, typically represented as floats or integers, May Be Used as They Are
            """,
            x="Number of Unique Values per Feature",
            y="",
            caption="https://www.immoweb.be/",
        )
        + theme(
            plot_subtitle=element_text(
                size=12, face="italic"
            ),  # Customize subtitle appearance
            plot_title=element_text(size=15, face="bold"),  # Customize title appearance
        )
        + ggsize(800, 1000)
    )
)
Figure 2: Assessing Feature Cardinality: Number of Unique Values per Feature

Looking at distributions

Distribution of features

Next, our focus will be on identifying low and high cardinality features. Subsequently, we will investigate how house prices vary when grouped according to these variables, using boxplots. Please take into account that the price values have undergone log transformation to address skewness.

Code
low_cardinality_features = (
    pd.DataFrame(number_unique_entries)
    .query("unique_values_pct <= 5")
    .column_name.to_list()
)
Code
high_cardinality_features = (
    pd.DataFrame(number_unique_entries)
    .query("(unique_values_pct >= 5)")
    .loc[lambda df: (df.column_dtype == "float32") | (df.column_dtype == "float64"), :]
    .column_name.to_list()
)
Code
plots = []

for idx, feature in enumerate(low_cardinality_features):
    plot = (
        df.melt(id_vars=["ad_url", "price"])
        .loc[lambda df: df.variable == feature, :]
        .assign(price=lambda df: np.log10(df.price))
        .pipe(
            lambda df: ggplot(
                df,
                aes(as_discrete("value"), "price"),
            )
            + facet_wrap("variable")
            + geom_boxplot(
                show_legend=False,
            )
        )
    )
    plots.append(plot)
gggrid(plots, ncol=4) + ggsize(900, 1600)
Figure 3: Exploring Price Variations Across Different Variables

Distribution of target

Upon examining the distribution of our target variable, which is the price, it becomes evident that there is a notable skew. Our median price stands at 379,000 EUR, with the lowest at 350,000 EUR and the highest reaching 10 million EUR. To increase the accuracy of our predictions, it is worth considering a transformation of our target variable before proceeding with modeling. This transformation serves several beneficial purposes:

  1. Normalization: It has the potential to render the distribution of the target variable more symmetrical, resembling a normal distribution. Such a transformation can significantly enhance the performance of various regression models.

  2. Equalizing Variance: By stabilizing the variance of the target variable across different price ranges, this transformation becomes particularly valuable for ensuring the effectiveness of certain regression algorithms.

  3. Mitigating Outliers: It is effective at diminishing the impact of extreme outliers, bolstering the model’s robustness against data anomalies.

  4. Interpretability: Notably, when interpreting model predictions, this transformation allows for straightforward back-transformation to the original scale. This can be achieved using a base 10 exponentiation, ensuring that predictions are easily interpretable in their origination task.

Code
before_transformation = df.pipe(
    lambda df: ggplot(df, aes("price")) + geom_histogram()
) + labs(
    title="Before Transformation",
)
after_transformation = df.assign(price=lambda df: np.log10(df.price)).pipe(
    lambda df: ggplot(df, aes("price"))
    + geom_histogram()
    + labs(
        title="After log10 Transformation",
    )
)
gggrid([before_transformation, after_transformation], ncol=2) + ggsize(800, 500)
Figure 4: Target distribution before and after log10 transformation

Geomapping

Next, let’s explore how median house prices vary across different locations. As indicated by the map below, Knokke-Heist boasts one of the highest median apartment prices, closely followed by Watermael-Boitsfort. This aligns with expectations, as per Wikipedia: “Watermael-Boitsfort is one of Brussels’ most affluent municipalities, with an average per capita income of €30,100 in 2002, exceeding the regional average for the Brussels-Capital Region by over €600.” On the other hand, Knokke-Heist is renowned as “It is Belgium’s best-known and most affluent seaside resort.” There you have it.

Code
grouped_city = (
    df.groupby("city")
    .price.median()
    .to_frame()
    .reset_index()
    .query("~city.isin(['Blégny', 'Woluwe-St.-Lambert', 'Wolvertem'])")
)


boundaries = (
    geocode_cities(grouped_city.city)
    .scope("Belgium")
    .inc_res(4)
    .get_boundaries(resolution=6)
)

(
    ggplot()
    + geom_livemap()
    + geom_polygon(
        aes(color="price", fill="price"),
        data=grouped_city,
        map=boundaries,
        color="#dbd6d6",
        alpha=0.9,
        size=0.5,
        map_join=[["city"], ["city"]],
        tooltips=layer_tooltips(["city", "price"]),
    show_legend=False
    )
    + labs(
        "Median House Prices in Belgium Cities",
        subtitle="Knokke-Heist boasts one of the highest median apartment prices, closely followed by Watermael-Boitsfort.",
        caption="https://www.immoweb.be/",
    )
    + theme(
        axis_title="blank",
        axis_text="blank",
        axis_ticks="blank",
        axis_line="blank",
        plot_subtitle=element_text(size=12, face="italic"),
        plot_title=element_text(size=15, face="bold"),
    )
    + scale_fill_gradient(low="green", high="red")
    + labs(fill="Median Price")
    + ggsize(1000, 800)
)

Correlations

Finally, we will delve into the correlations among variables with high cardinality through Spearman correlation analysis, with a particular focus on the price as the target variable. As evident from the heatmap, the price exhibits a strong correlation with cadastral income (correlation coefficient = 0.77), living area (correlation coefficient = 0.74), and bathrooms (correlation coefficient = 0.59). For your reference, cadastral income is an annual Flemish property tax based on the assessed rental income of immovable properties in the Flemish Region. This income is a notional rental income assigned to each property, whether it is rented out or not.

Code
(
    df.loc[:, lambda df: df.columns.isin(high_cardinality_features)]
    .corr(method="spearman")
    .pipe(
        lambda df: corr_plot(df)
        .tiles(
            "lower",
        )
        .labels(type="lower", map_size=False)
        .palette_gradient(low="#2986cc", mid="#ffffff", high="#d73027")
        .build()
        + ggsize(900, 900)
        + labs(
            title="Spearman Correlations Among High Cardinality Features",
            subtitle=""" The price demonstrates robust correlations with key factors, including cadastral income (correlation coefficient = 0.77), 
            living area (correlation coefficient = 0.74), and bathrooms (correlation coefficient = 0.59)
            """,
            x="Number of Unique Values per Feature",
            y="",
            caption="https://www.immoweb.be/",
        )
        + theme(
            plot_subtitle=element_text(
                size=12, face="italic"
            ),  # Customize subtitle appearance
            plot_title=element_text(size=15, face="bold"),  # Customize title appearance
        )
    )
)
Figure 5: Spearman Correlations Among High Cardinality Features

In part 3, we’ve performed some initial data exploration, and our next stride entails constructing a foundational machine learning model. In part 4, we’ll compare various algorithms to establish a benchmark for our subsequent endeavors in model building and feature engineering. This baseline model will serve as a reference point, guiding us as we strive to enhance our predictive capabilities. Looking forward to meeting you again in Part 4!